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■ Abstract 

o 

■ In this paper, a method of domain adaptation for clustered language 

models is developed. It is based on a previously developed clustering algo- 

£ — . rithm, but with a modified optimisation criterion. The results are shown 

ON ' to be slightly superior to the previously published 'Fillup' method, which 

can be used to adapt standard n-gram models. However, the improvement 
both methods give compared to models built from scratch on the adap- 
tation data is quite small (less than 11% relative improvement in word 

S error rate). This suggests that both methods are still unsatisfactory from 
a practical point of view. 
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1 Introduction 



> 

X 

^ ■ Current large vocabulary speech recognition systems can achieve good perfor- 

mance on domains for which large quantities (e.g. millions of words) of textual 
data are available to train a language model. In real world applications, how- 
ever, this is quite often not the case. The issue of language model domain 
adaptation is therefore of great practical importance. 

One approach to tackle this problem is to try to learn from an analogy to 
the speaker dependence issue: current systems perform well by training speaker 
independent models, which can then be adapted with relatively little data from 
a given speaker (see || ) . Can the same approach be applied to language model 
adaptation? 

In section || previous work in this area is reviewed and a rough working 
definition of domain is given. A method to perform domain adaptation with 
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clustered language models is then developed (Section ||). Experimental results 
to evaluate the method are given in Section [|, followed by conclusions in Section 

I- 

2 Background 

In order to make the description of domain adaptation more precise, a definition 
of domain is needed. One might be tempted to define domain in the sense of 
semantic topic. However, texts might differ in other aspects (e.g. style), which 
could still require language model adaptation. A more general definition of do- 
main, more in line with the term sublanguage, is therefore required. According 
to [[To) , there are many different definitions of the term, but most of them seem 
to agree on the following characteristics of a sublanguage: 

1. it is part of a natural language 

2. it is of a specialised form 

3. it behaves like a complete language 

4. it is used in special circumstances (e.g. expert communication) 

5. it is limited to a particular subject domain 

Some of these points seem very useful for the concept of domain (2,4), others 
less so (1). What properties should an acceptable definition of domain have? 
The following spring to mind: 

• there should be a continuum (e.g. an infinite number) of domains 

• each domain may contain an infinite number of elements (e.g. docu- 
ments / sentences / words) 

• for a given element, one should be able to decide whether or not it belongs 
to a given domain 

• all elements of a domain should have a common feature (which defines the 
domain) 

This leads to the following rather wide working definition of domain and hence 
domain adaptation: A domain D is a (often infinite) set of documents such 
that each document satisfies a property Pd (e.g. 'the document deals with 
some aspect of law'). Given a sample Snack of domain Dsack (background 
domain) and a sample S Adapt of domain D Adapt (target domain) , the problem 
of language model domain adaptation is to produce a language model for 
D Adapt by using S Adapt and by carrying over some of the information contained 
in Snack- 



2 



Domain adaptation can be divided into static and dynamic domain adapta- 
tion, depending on the time scale used to perform adaptation. Dynamic adapta- 
tion tries to capture phenomena with a shorter time scale (e.g. topic shifts) and 
is performed on line, whereas static adaptation can be used to perform a one- 
time shift from one domain to another and is performed off line. Previous work 
has shown improvements by using both dynamic adaptation of n-gram models 
fill , 0) and by using static adaptation of n-gram mod- 
Pl, fll3|). Since the 'Fillup' method presented in gives better 
performance than linear interpolation, the 'Fillup' method is used as method 
of comparison for the adaptive clustering, which will be developed in the next 
section. 




3 Adaptive Clustering 

The task of a language model is to calculate p(wi\ci), the probability of the next 
word being Wi given the current context C{. Language models differ in the way 
this probability is modelled and how the context Cj is defined. A quite general 
model proposed in Q makes use of a state mapping function S and a category 
mapping function G. The idea behind the state mapping S : c— > s c = S(c) 
is to assign each of the large number of possible contexts c G C to one of a 
smaller number of context-equivalent states. Similarly, the category mapping 
G : w— > g w = G(w) assigns each of the large number of possible words 
w S V to one of a smaller number of categories (similar to parts of speech). 
The probability of the next word is then calculated as 

p(w i \c i )=p(G(w i )\S(c i ))*p(w i \G(w i )). (1) 

In order to determine 5* and G automatically, a clustering algorithm as shown 
in Figure [l] can be used. It is a greedy, hill-climbing algorithm that moves 

Algorithm 1: Clustering() 

start with initial clustering functions 5, G 
iterate until some convergence criterion is met 
for all w G V and c G C 

for all g' w £ G and s' c 6 S 

calculate the difference in the optimisation criterion when w/c 
is moved from g w /s c to g' w /s' c 
move the w/c to the g' w /s' c that results in the biggest improvement 
in optimisation criterion 
End Clustering 

Figure 1: The clustering algorithm 
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elements to the best available choice at any given time. Based on equation [l] 
and on the leaving-one-out likelihood of the model generating the training data, 
an optimisation criterion can be derived (see |Q for a detailed description). 
Let iV(e) denote the number of times event e appeared in the training data, let 
B denote the smoothing parameter used for absolute discounting and let 

no,n 1; n + denote the number of pairs (s,<?) that have appeared zero, one and 
one or more times in the training data. The resulting optimisation criterion F 
(as derived in 0) is 

F = Y, N(s,g)*log{N{s,g)-l-B) (2) 

s,g:N(s,g)>l 

+ ni *'° g( (no + 1) } 

- ]T N(s) * log(N(s) - 1) - N (9) * log{N{ 9 ) - 1). 

s a 

The basic building block in the derivation of equation ^ is the likelihood of 
one event in the training corpus, as estimated from the training corpus in which 
this one event has been removed (leaving-one-out likelihood). The main idea 
behind the adaptive clustering is to use as basic building block the likelihood 
of one event in S Adapt, as estimated from a linear interpolation of counts from 
SBack and S Adapt from which this one event has been removed. The motivation 
for this is that the clustering can thus optimise the perplexity on S Adapt, while 
having access to a linear combination of counts from Snack and S Adapt- 

Let -/V^(e) (Ns(e)) denote the number of times event e appeared in S Adapt 
(SBack)- Define Nc(e) to be the linear interpolation of the two counts 

N c (e) = Round{\ * N A (e) + (1 - A) * N B (e)) (3) 

where Round(x) returns the integer nearest to x. The only events that can 
contribute to the optimisation function are events that occur at least once in 
SAdapt (because, as explained above, the likelihood of S Adapt is taken as opti- 
misation function). However, their probability is calculated based on the com- 
bined counts. Therefore, the smoothing has to apply to the combined counts. 
Define n^o, nbi,i, n D i,+ as the number of pairs (s, g) that have a combined count 
Nc(s, (?) of 0, of 1, and larger than 0. In order to introduce absolute discounting 
for the unigram estimates as well, also define n Sj o, n s ,i, n Sj+ as the number of 
states s that have a combined count Nc(s) of 0, of 1, and larger than (simi- 
larly, define n ffj o e tc. for the unigram estimates involving g). Changing equation 
U according to the basic idea outlined above, this leads to 

Fadapt (4) 
T,s,g:N A ( s ,g)>=l,N G (s,g)>l N A(s,9)*log(N C (s,g) - 1 - B) 
. j / B*(rihi -i.—X)\ 

+nb ^ * (» M|0 +i) ) 
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- E S :N A (s)>=i,Nc(s)>i N *( s ) * log(N c (s) - 1 - B) 

- E g :N A ( g )>=iM c ( g) >i N A (g) * log{N c {g) - 1 - B) 

, / B*(n s j_ — 1) n, , /B*(n„+ — 1) % 

-n,,i * log{ („ s +i) ) ^ * M (n,, +i) )• 

By using the same clustering algorithm as before, but with F^dapt instead 
of F as optimisation criterion, language model domain adaptation can be per- 
formed. 

4 Results 

In order to test different adaptation methods, two textual samples Ssack and 
SAdapt and acoustic testing data from D Adapt are required. Since the WSJ 
domain has the associated acoustic data, it is used as D Adapt- As Dsack, the 
patent domain (PAT) was chosen, for which a large sample SAdapt (about 35 
million words are used) is also available from the LDC as part of the TIPSTER 
database. 

The recognition system is a state-of-the-art HMM based system (continuous 
densities, mixtures, triphones). All experiments are based on bigram language 
models, either clustered (500 clusters) or backoff (singleton bigrams were ig- 
nored). The different methods evaluated were 

• Backso- a backoff model built on the background corpus 

• Backci- a clustered model built on the background corpus 

• Adaptci- a clustered model built on the adaptation data 

• Adapt Bo- a backoff model built on the adaptation data 

• Fillup: a model built according to the 'Fillup' method presented in |IJ 

• ClustAdapt: a model built with the adaptive clustering presented in the 
previous section; the initial starting point for the clustering is taken to be 
the clustering produced by Backci] one global A parameter was used and 
optimised iteratively at the end of each iteration; 

For all methods except Backso and Backci, the vocabulary was defined to 
be all the words that appeared in SAdapt-, phis additional words from Snack 
until 20K words were reached. For Backso and Backci, the vocabulary con- 
sisted of the 20K most frequent words in Snack- Because of this difference, the 
perplexities of Backso and Backci are not directly comparable to those of the 
other models. For each method and a given amount of adaptation material, the 
perplexity of the resulting model was calculated on a held-out section of SAdapt 
and a recognition run was performed on the acoustic data. 
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Model 


PP 


WER (%) 


Backci 
Back Bo 


955 
954 


49.3 
48.4 



Table 1: Baseline results 



Adapt, words 


PP 


WER (%) 


200 


6130 


57.0 


1000 


2740 


54.0 


5000 


1740 


47.6 


25000 


966 


39.2 


125000 


593 


33.0 



Table 2: Results for Adapts, 



Table |lj gives the results of the two baseline methods, which do not use 
any of the adaptation material. The high perplexities show that the PAT and 
WSJ domains are considerably different. The rate of out-of- vocabulary words 
is about 15%, which is one reason for the very high error rate. 

Tables |, || | and | give the results for the different methods and different 
amounts of adaptation material. 

Comparing Table || to Table ^, one can seen that Adaptci is more robust 
than Adapt Bo and it leads to better recognition results for almost the entire 
range of adaptation material. This is consistent with previous results (see ]l5|), 
which showed that clustered models are more robust in terms of perplexity. 

Comparing Table [|to Table |[ one can see that Fillup outperforms Adaptci 
in almost all cases. 

By looking at Table g, one can see that ClustAdapt outperforms Fillup in 
almost all cases. 

Finally, When comparing table [5] to table ||, one can see that the relative 
improvements in word error rate by using ClustAdapt instead of Adaptci are 
10.7%, 5.87%, 3.45%, -2.70% and 1.80%. 



Adapt, words 


PP 


WER (%) 


200 


4170 


57.0 


1000 


2150 


51.1 


5000 


1210 


46.4 


25000 


765 


37.0 


125000 


498 


33.4 



Table 3: Results for Adaptci 
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Adapt, words 


PP 


WER (%) 


200 


848 


49.9 


1000 


772 


49.6 


5000 


628 


44.9 


25000 


543 


38.2 


125000 


420 


33.0 


Table 4: Results for Fillup 


Adapt, words 


PP 


WER (%) 


200 


941 


50.9 


1000 


1040 


48.1 


5000 


821 


44.8 


25000 


801 


38.0 


125000 


623 


32.8 



Table 5: Results for ClustAdapt 



5 Conclusion 

Compared to the success of some methods for acoustic adaptation, the results 
obtained here are somewhat disappointing. In particular, they seem to suggest 
that the improvements from the adaptation techniques compared to starting 
from scratch on the adaptation data become quite small when several tens of 
thousands of words are available One reason for this could be the fact that the 
acoustic space has an underlying distance metric and thus allows the comparison 
of two elements. Moreover, one can specify the kind of transformations one 
would want the adaptation to be able to perform. Both of these points seem 
more difficult in the case of language model adaptation. 

Even though the adaptation method for clustered language models developed 
in this paper gives slightly better results than the 'Fillup' method, the accuracies 
obtained with the adaptive clustering and the 'Fillup' method are still very low 
compared to the about 80% or more the system can achieve with a backoff 
bigram trained on about 40 million words of the WSJ corpus. Both adaptation 
methods are therefore still unsatisfactory from a practical point of view. 
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